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Framework for managing cluster membership in a multiprocessor system 



(57) A shared-disk cluster system includes a cluster 
membership managerf ramework which coordinates the 
joining or leaving among ali nodes in a cluster including, 
taking the multiple layers of involved subsystems 
through transitions. Subsystems are notified of transi- 
tions in particular order depending upon the transition, 
and all nodes' subsystems receiving a notification must 
process that notification prior to another layer of subsys- 
tems being notified. One of the subsystems registered 
for notification is an event manager in user space. The 
event manager carries out transfers of client services, 
including user applications, resulting from nodes joining 
and leaving the cluster. This includes a registration and 
launch servic'5 which registers a node, or multiple 
nodes, in a cluster which claims, or is assigned, respon- 
sibility for the service and provides an optional launching 
function which initiates the client service upon success^ 
ful registration. 
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This invention relates generally to multiprocessor systems and. nnore particuiariy, to shared-disk cluster systems 
More particularly the invention relates to a framework for joining and disjoining nodes in a multiprocessor cluster 
system. 

A multiprocessor cluster system typically includes multiple nodes, which are interconnected with a private com 
municaticn interconnect. The cluster system additionally includes a shared cluster resource, such as a virtual hard 
disk, which is accessible to all of the nodes, which run an operating system supporting coordinated access to the 
shared resource. Cluster systems have many advantages. They provide high availability to the user because availabilitv 
does not depend upon all of the nodes being active participants in the cluster. One or more nodes may leave the clus'er 
withou't necessarily affecting availability. New nodes may be added to the system without requiring that the system be 
taken down and rebooted. Additionally, nodes may incorporate processor designs that are different from one another 
which facilitate expansion of the system. In this manner, the cluster system provides high agqrecate performance ' 
Shared-disk cluster systems have typically been used for database services which require a di'slributed lock system 
in order to avoid contamination of data on the shared virtual disk. Membership management in such a cluster system 
required providing cluster awareness to the distributed lock system. However, such shared-disk cluster systems have 
been m.ted because cluster awareness extends to only one layer of subsystem. Panicular operating systems have 
multip e suDsystems which are layered in a manner that a higher level subsystem must depend upon the operation of 
lower level subsystems. Known cluster membership management techniques are not capable of taking such layered 
suDsystems through cluster transitions of nodes joining and leaving the cluster 

Client sen/ices are typically distributed among the nodes of the cluster requiring extensive coordination of which 
node implements which service. This is especially difficult during node transitions of a node joining or leaving the 
cluster. This ,s because most services are not aware of the cluster environment. The client services would typically 
determine on their own the best node to execute on. A recovery mechanism would be required for Initiating recovery 
If the node currently executing the service leaves the cluster. Allowing individual sen/ices to implement their own mech- 
anism ,or this coordination requires detailed modifications to the client services to allow them to run on a cluster system 
which makes administration of the cluster more burdensome and difficult because inconsistent mechanisms may be 

The invention in its various aspects is defined in the independent claims below to which reference should now be 
made. Advantageous features are set forth in the appendant claims. 

A preferred embodiment of the invention, described in more detail below with reference to the drawings provides 
a method and apparatus for combining particular processors, or nodes, of a multiprocessor system in a cluster that 
appears substantially as a unified processor to users of the system. Multiple subsystems running on nodes presently 
in the cluster are not, led or transitions of nodes joining and leaving the cluster. This provides a consistent view of active 
rnsmbership in the cluster to the subsystems of the cluster nodes whereby all of the node's subsystems may be taken 
through the node transirions. This feature is particularly useful with subsystems that are interdependent in levels with 
higner level subsystems depending on the operation of lower level subsystems. A particular transition is noticed to the 
Same level subsystem on all nodes. Notification will not proceed to another subsystem level until the noticed subsystem 
of each node processes that notification and -acknowledges that such processing has been completed When the 
transition is a node pining the cluster, subsystems are notified beginning wrth lower level subsystems and proceed in 
equence through higher levels of subsystems, When the transition is a node gracefully leaving the cluster, suSysTems 
are notmed beginning with higher level subsystems and proceeding in sequence through lower level subsystems When 
11 rf?" '1 ' H '''"^ ^"S^acefully forced from the cluster by other nodes, subsystems are notif ed beginning 
with lower level subsystems and proceeding in sequence through hioher level subsystems ^ 
A registration and launch function is provided in which client services, including user applications are initiated on 

TnT::T ""'r ' ''^ ^'^^^^^ ^^^^^^^ ^^^^'-"^"V as a uniform'Smt to the c2 apices 

A node IS chosen for each client service and that client seivice is registered with the node. Nodes presently in the 
cluster are notified that the particular service ,s registered with the particular node. In this manner, client services can 
be t ansferred to another node if the node on which that se^ice is registered leaves the cluster. The client service may 
be launched on a node, according to an action parameter included with the service, in response to registering that 

n ^ate themselves each time they are transferred. Client sar^ces may, advantageously, be grouped as a parent service 

tlZZZT T""'- ^^"'^^^ ^'^^ ^'"-^ action parameters 

includedwiththeparentserviceforalllaunchingactivity within the group, The choosingofanode for eachclientservice 
may include providing a database of choosing factors for the client service and applying the choosing factors to inlr 

Such registration and launch function is preferably a component of an event manager, which is a subsystem which 
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receives notification of node transitions from the cluster membership manager. The event manager monitors client 
services registered with a particular node using an event watcher and provides action items which are carried out in 
response to occurrence of an event, such as a node transition. The event watcher may be enabled in response to 
registering of a client service and disabled in response to de-registering of the client service. 
5 The preferred embodiment of the invention will now be described in more detail, by way of example, with reference 

to the drawings, in which: 

Fig. 1 is a block diagram of a multiprocessor cluster systenn embodying the invention; 
Fig. 2 is a state transition diagram of a transition notification framework for one subsystem level; 
10 Figs. 3-10 are diagrams of states of subsystems in a two-node cluster illustrating nodes joining the cluster; 

Figs. 11 and 1 2 are diagrams of states of subsystems in a two-node cluster illustrating a graceful leave of a node 
from the cluster; 

Figs, 13-15 are diagrams of states of subsystems in a two-node cluster Illustrating an ungraceful forced leave of 

a node from the cluster; 
75 Fig. 16 is a block diagram illustrating the grouping of client services; 

Fig. 17 is a diagram similar to Fig. 15 illustrating multiple generations of client service groupings; 

Fig. 18 is a state transition diagram illustrating the launching of a client service; 

Fig. 1 9 is a state transition diagram illustrating the transition states of a client sen/ice; 

Fig. 20 is similar to Fig. 18 illustrating additional transition states; and 
20 Fig. 21 is a block diagram ol an event manager subsystem. 

HARDWARE 

Referring now specifically to the drawings, and the illustrative embodiments depicted therein, a multiprocessor 
25 cluster system 25 includes multiple nodes 26 and a shared-cluster resource, such as a physical disk 2S, which could 
be made up of multiple physical disk drives (Fig. 1), Each node 26 includes a processor (CPU), physical memory 
caches, shared and private bus interfaces, and optional dedicated devices. Each node runs a copy of a UNIX-based 
operating system, such as DG/UX 5.4 operating system marketed by Data General Corporation of Westboro, Massa- 
chusetts, running on any hardware configuration which supports such operation system. An example of such hardware 
30 configuration is the AViiON© family m.arketed by Data General. 

Cluster system 25 additionally includes an interconnect 36, which is a dedicated shared-cluster communication 
media that allows nodes 26 to talk directly to all other nodes in the same cluster, and a shared-cluster I/O bus 32, which 
allows all nodes to share all devices physically connected to the shared bus, such as disk 28. In the illustrated embod- 
iment, shared bus 32 is a SCSI standard bus. 

35 

SOFTWARE 

Cluster system 25 includes a single membership database 34, which occupies a dedicated shared-cluster virtual 
disk, which lives on physical disk 2S along with a cluster-cognizant bootstrap 38. Membership database 34 manages 

^0 persistent node configuration information 40 that is needed to boot, shutdown, or panic a node 26. Such persistent 
information includes identification of the number of nodes configured with the system, as well as configuration infor- 
mation about each node, indexed by a node identification number. Membership database 34 additionally includes an . 
active membership state database 42, which contains transient information about node states. Such transient infor- 
mation changes dynamically as nodes join the cluster gracefully, gracefully leave the cluster, or are ungracefully forced 

^5 out of the cluster A node can have any one of the following states: 

Inactive - The node is not configured or is not an active member of the cluster. 

Joining - The node is in the process of joining the cluster, which implies that the node has informed other nodes 
of its intention to join the cluster gracefully, but not all of the registered subsystems of nodes in the cluster have 

so completed transitions to gracefully include the new node. 

Joined - The node has fully joined the cluster and all registered subsystems of nodes in the cluster accept the new 
node as a member of the cluster and have completed their transitions to include the new node. 
Leaving - The node is in process of. leaving the cluster, which implies that the node has informed other nodes of 
iis intention to leave the cluster gracefully, but not all nodes' registered subsystems of nodes in the cluster have 

55 completed transitions to gracefully exclude the new node. 

' Forced-Leaving - Other nodes are in the process of forcing this node out of the cluster. Other nodes may force out 
a node if that node is not functioning properly, such as failing to communicate with other nodes. After the other 
nodes have completed processing of the forced-leave, which includes running recovery procedures, the other 
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nodes mark this node as inactive. The forced-oul node will panic after it lias noticed that the other nodes have 
forced It out, A node panics by halting further processing in order to avoid corrupting shared cluster resources. 

A. MEMBERSHIP MANAfiF.q 

Cluster system 25 includes a membership manager framework including a transition notification framework 4^ 
which provides notrfications to all kemel-level and user-level subsystems that must receive notifications of ciu Jer 
transitions (Figs. 2-15). The purpose of transition notification framework 44 is to provide cluster<:ogn za t bsvS S 
a coherent technique for processing cluster transition information among the nodes. Cluster^ognizant subsvst'em^s 
are subsystems which are registered with a node's cluster membership manager subsystem 46 In the illustrated em 
bodiment each node 26 includes four kernel-level subsystems, including clus.e'r membership mana e ^ S 
Which collectiveyprovidetr^ 

di k manager (VDM) subsystem 50, and a shared file system (SFS) subsystem 52. Each node 26 additionally incl d" 
=t e.st o,ne user-tevel subsystem; namely, an event manager subsystem 54. Such subsystems 46-54 are interdepend 
'5 ent upon each other^n levels. In the illustrated embodiment, membership manager 46 is the lowest level subsystem 
and event manager 54 is the highest level subsystem. However, other higher level subsystems could be provided A 
global ransition ordering is provided for the subsystems, with lower level subsystems receivina smaller values and 
higher level subsystems receiving larger values. " 
^0 tH , "^^ framework 44 operates as follows. Before a node joins a cluster, interested subsystems of 

'■0 that node register Iheir intention to receive notifications of cluster transitions. A registered subsystem must also supplv 
a thread of con rol that blocks waiting for transition notifications from membership manager subsystem 46 Inhat nod^ 
During graceful joins and forced leaves of nodes, all nodes' membership manager subsystems 46 coord nate fo no% 
the node's registered subsystems in a bottom-up fashion with respect to the global transrtion -ordering scheme as w J 
'= wt ' t ""T ? I ''^"'^^^■''P "^"^Ser subsystems notify, first, all of the node's SLb yJt^m " 

- wrth the lowest order followed by the next highest order, on up to the highest order. Conversely, during graceful leaves 
of nodes, all nodes' membership manager subsystems 46 coordinate to notify the node's revered sSbsvsterns n I 

to thlTl t H "T' T'- '"^^ ^'9^^^' °^^^^' the ne« highest ord r down 

to the lowest order. This ordering is so that higher level subsystems' dependencies on lower level subsystems a"e 
satisfied. That ,s a lower level subsystem first processes a node join transition so that higher level subsystems can 
be ensured tfnat the subsystems they depend upon, namely, lower level subsystems, are aw^re of and have Jomp eteS 
processing of the ,oin. Conversely, a higher level subsystem must first process a graceful leave so that the lowe'lev=l 
subsystems remain operational in the leaving node during the leave transition. An ungraceful leave is processed ram 
Ihe^bottom-up to ensure that all error conditions are propagated upward before attempting recover at the next h^gheS 

,^^^Tr< ''^f iP "manager subsystem will not proceed with notification to the next-in-line subsystem until 

each node s currently-in-line subsystem acknowledges its completion of processina for the transition However each 

l'SillT^T'Tr''%'\""^'^''''" "'"y " ^"9'^'^^^'^ ^"^^^^'^^ '° P^°<=^^^ "^""'Pl^ transitions for different 
nodes at the sar^e time. Each of these transitions may be of a different type. This improves performance in siU^a ioTs 
w ere many nodes are undergoing transitions contemporaneously, such as when many nodes boot af er a pow 
fa ure that has powered down the entire cluster. However, each node's membership manager will not no ify ubsy tems 
out-of-order for a particular transitional node. As a result, multiple transitions for different nodes may be p ocSd a 

system ordering for each transitional node. 

For examples of use of cluster subsystems to participate in graceful joins, graceful leaves, and ungraceful forced 
leaves ut , zing transition notification framework 44. reference is made to Figs, 2-15, which illustrate a c u t syster^ 
having potentially a two-node cluster. The examples illustrated in Figs, 3-15 may be generalized to mree or Ire noS 
w h each node transition sequencing through each subsystem, one at a time, aaoss all nodes Each mTbersh p 
manager would no propagate the same node transition to the next highest subsystem until all nodes at the current 
level have acknowledged their completion of transition processing for the new node 

NO InTlTVTTf'!^ M^-. ^' '"''"^ "° ^"^''^^ ^y^t^"^ administrator powers node 

node NO llns ,L H t V 'f '° gracefully join the cluster as the first acSv. mimbe 

Kernel subsystems 4o-52, which each register themselves for transition notification (Fig, 4), The subsvstems soawn a ' 
hread hat ma as a kerne, call which blocks because no cluster transitions have JurL Lt thit ime,Ce NO' Tn" 
subsystem (not snown) initiates node NO's graceful join through the highest currently registered subsystem Member- 
ship manager subsystem 46 of node NO forms the cluster and marks node NO's ac ive state as joinrg Mer^be^shio 
manager 46 ot node NO notifies its DLM subsystem 48 that node NO is joining the cluster (Fig 5 T e 'bread of dS 
subsystem 48 of node NO is awakened, notices node NO'S new joining state, hands the joLd proc s "g o ^o 3 
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different DUM thread, then completes its join. Node NO's DLM subsystem marks node N.O's state ss joined and informs 
membership manager 46 of node NO that DLM has completed its joined processing for node NO The same proces- 
is repeated for VDM subsystem 50 and SFS subsystem 52 of node NO (Fig. 6). * 
After having joined the cluster at the kernel level, node NO proceeds to user space (Figs 7a-7c) Node NO's INIT 
subsystem (not shown) spawns event manager subsystem 54, which spawns a thread which returns immediately 
because the event manager 54 of node NO has not yet processed node NO's graceful join. After having process-d 
node NO'S graceful pin, node NO's event manager 54 marks node NO's state as joined and informs node NO's mem- 
bership manager 46 that it has processed node NO's graceful join. 

In Fig. 8, the administrator powers and boots node N1 which causes nodes NO and N1 to perform a graceful join 
of node N1. Node Nl opens the membership database 34. retrieves it's configuration information and initialize^ its 
kernel subsystems 46-52, which register tor transition notifications, The membership manager of joinino nodes must 
negotiate with the cluster master node in order to join the cluster. When there are multiple nodes in the cluster, one 
node becomes the master node utilizing Decker's algorithm, which is known in the art. The master node writes its 
heartbeat in a particular area of membership database 34. Joining nodes will examine such area for the heartbeat in 
order to identify the master node. Membership manager subsystem 46 of node Nl negotiates with the membershio 
manager oi node NO, which must be the master node because it is the only node in the cluster, in order to join the 
cluster. The membership managers of nodes NO and NT mark the state of node Nl as joining the cluster. The mem- 
bership managers of nodes NO and NT notify their respective DLM subsystems 4S that node N is joining the cluster 
Both DLM subsystems wake up from their calls to begin processing node NTs graceful join. After both DLM subsystems 
have coordinated in processing node NTs graceful join, the DLM subsyslems mark node Nl's state as joined and 
acknowledge to the membership manager. After having received both DLMs' acknowledgements the membershio 
managers of nodes NO and Nl notify the respective VDM subsystems 50 that node Nl is joining the cluster (Fiq 9) 
After both VDM subsystems have processed node NTs join, both subsystems mark node NTs state as joinec and 
acknowledge the same to their respective membership managers. 

After having joined the cluster at the kernel level, node Nl proceeds to user space with its INIT subsystem (not 
shown) spawning event manager 54. Node Nl's event manager registers itself with the membership managers Node 
NTs event manager spawns a thread that makes a kernel call which is returned immediately because node NTs event 
manager must process node NTs graceful join. Node NO's event manager wakes up to process node NTs graceful 
join. After having coordinated to process node Nl's graceful join, both event managers 54 mark node Nl's stat- as 
joined and acknowledge the graceful join to their respective membership managers. Node Nl is joined as illustrated 
in i-ig. 10. 

A node may initiate a graceful leave while the node is still in the joining state. However, a joining subsystem will 
not convert the ioining states directly to a leaving or an inactive state. The joining subsystem must complete and 
acknowledge the joined transition. The membership manager will only reverse the joining state to the leavinc state 
between notifications to registered subsystem levels. 

A processing, by transition notification framework 44, of a graceful leave of a node, such as what would occur 
during a shutdown of a node, is illustrated by reference to Figs. 11 and 12. Node NO initiates its shutdown by making 
tfie appropriate call to initiate a graceful leave. The membership manager subsystems of nodes NO and Nl mark node 
NO as leaving. The membership managers of nodes NO and Nl wake up both event managers 54 with node NO's 
transition Both event managers note node NO's state as leaving and begin their coordinated processing of node NO's 
graceful leave. As will be described in more detail below, the processing of node NO's graceful leave by both event 
managers may involve a considerable amount of application level shutdown, after which both event manag-rs mark 
node NO as inactive and notify their respective membership managers. Node NO's membershio manager automatically 
de-registers node NO's event manager for transition notification whereby node NO's event manager will receive no 
further notifications. Next, the membership manager of nodes NO and Nl perform the same iteration with the SFS 
subsystems 52 of both, nodes, then with both VDM subsystems 50, and then DLM subsystems 48 Finally, the mem- 
bership manager of both nodes mark node NO as inactive, which also is the end of node NO's graceful leave Node NO 
performs kernel level shutdown processing and returns to the boot command line. 

It should be noted that a node may not initiate a graceful join while other nodes are processing the node's graceful 
leave^ln practice, this situation can occur when the leaving node has died abnormally and re-boots before other nodes 
have had a chance to notice that the leaving node has died. As soon as the other nodes notice that the leaving node 
has actually died, the other nodes will force the dead node out of the cluster, aborting their graceful leave proce-sinq 
The other nodes will subsequently accept the new node's graceful join request. 

An ungraceful, forced leave is an abnormal situation: for example, when a node is no longer capable of commu- 
nicating with the other cluster nodes. Once the forced out node notices that the other nodes have forced that node out 
of the cluster, the forced-out node panics. Transition notification framework 44 ensures that the forced out node does 
not corrupt any shared-cluster resources. When a registered subsystem is in the middle of processing a graceful join 
or leave of the forced-out node, each node's membership manager could re-notify the processina subsystem to abort 
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its graceful processing for the node and begin recovering processing, 

An 6xannp!e of a forced leave is illustrated with respect to Figs. 13-15 which begins with node N1 being joined to 
the cluster and node NO booting and joining the cluster gracefully. It, by way of example, the membership managers 
of nodes NO and N1 have processed node NO's graceful join up through the VDM subsystems 50, but have not received 
5 an acknowledge from the SFS subsystems 52 of their completion of node NO's graceful join processing because the 
SFS subsystems of nodes NO and N1 cannot communicate due to an interconnect failure on node NO, node NO's 
graceful join is noted as in a hung state (Fig. 1 3). The membership manager of node N1 notices that node Nl can no 
longer communicate with node NO. Node Nl forces node NO out of the cluster by marking node NO's state as forced- 
leaving. The membership manager of node NO notices that node Nl has forced out node NO and panics immediately 
10 (Fig. 14). The membership manager of node N1 initiates forced-leave processing starting from the lowest level sub- 
system and proceeding up to the highest registered subsystem. The DLM subsystem 48 of node N1 marks node NO 
as forced-leaving, notices the abrupt transition from joined and begins recovery processing, as illustrated in Fig, 14. 
After the DLM subsystem of node Nl acknowledges rts completion of recovery processing for the forced-leave of node 
NO by marking node NO's state as inactive, the membership manager of node Nl performs the same iteration with 
?5 respect to VDM subsystem 50. 

■ Node NO may not re-join the cluster gracefully until all of node N1's subsystems have completed their processing 
of node NO's forced leave. When the membership manager of node Nl has finally caught up with SFS subsystem 52 
of node N 1 , this subsystem will abort its processing of node NO's original graceful join and will perform recovery process- 
ing followed by an acknowledgement of its completion of the forced leave processing for node NO by marking node 
20 NO'S state as inactive. The membership manager of node Nl would normally continue iterating the iorced-leave noti- 
fication through the highest registered subsystem, event manager 54, However, because SFS subsystem 52 was the 
highest subsystem to be notified of node NTs graceful join attempt, forced-leave processing will progress only through 
the SFS subsystem. After processing the forced-leave notification through the highest appropriate subsystem, in this 
case the SFS subsystem, the membership manager of node N1 marks the state of node NO as inactive (Fig. 15). 
25 Table 1 illustrates, for a given transition node, the types of notifications that the membership manager will send to 

a registered subsystem and the corresponding acknowledgements that the membership manager expects to receive 
of the registered subsystem after the registered subsystem has completed its processing of the transition. Table 1 also 
lists the re-notifications that the membership manager may send to the registered subsystem while the registered 
subsystem is still processing the original notification for the transitional node. 
30 Some registered subsystems may need to perform a two-or-more-phase commitment operation for one of the 

particular node transitions. In order to provide such multiple phase commitment, membership manager 46 provides 
barrier synchronization. Each registered subsystem may specify a number of barriers the subsystem wants for each 
type of node transition. The membership managers then provide notifications that are barrier-synchronized with sub- 
system levels. All nodes at a given subsystem level must acknowledge its completion of the transition processing for 
3S the particular barrier before the membership manager will proceed to the next-in-order subsystem. For example, if the 
DLM subsystem asks for two "joining" barriers during a joining transition, ail DLM subsystems must acknowledge joining 
barrier 0 before they will be notified of barrier 1 . After they acknowledge barrier 1 , the joining transition will propagate 
to the VDM subsystem, which may have a different number of barriers. Also, alt subsystems at a particular level must 
register with the same number of barriers for each type of transition. 
40 A state-diagram for transition notification network 44 is illustrated in Fig. 2 to illustrate the types of notifications 

that an individual registered subsystem may receive for a node transition. As may be obsen/ed in Fig. 2. transitioning 
may proceed through multiple barriers for each transition type. For clarity only one barrier is illustrated for forced- 
leaving. However, multiple barriers are allowed. Dashed tines represent transition notifications. Solid lines represent 
acknowledgements from the individual subsystem on a single node or across all nodes. 
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MEMBERSHIP MANAGER NOTIFICATIONS OF TRANSITIONS FOR A PARTICULAR NODE TO A REGISTERED 
SUBSYSTEM AND EXPECTED ACKNOWLEDGEMENTS FROM THE REGISTERED SUBSYSTEM 



5 TABLE 1 





Sent Notification (from 
MM to subsystem) 


New State After All 
• Nodes Have 
Acknowledged 


Allowed Re- 
Notifications (from MM 
to subsystem) 


Notes 


70 
75 


join tn y 


Joined 


Forced-Leaving 


Straigiitforward, but this 
subsystem must keep a 
lookout for the Forced- 
Leaving re-notification and 
abort its graceful join in a 
timely fashion. 


20 


Leaving 


Inactive 


Forced-Leaving 


sStraightforward. but this 
subsystem must keep a 
lookout for the Forced- 
Leaving re-notification and 
abort its graceful leave in a 
timely fashion. 


2S 


Forced-Leaving 


Inactive 


None 


Causes this subsystem to 
abort any graceful join or 
leave processing for the 
node. 


30 


Inactive 


inactive 


Joining 


o^yyan Inactive notification 
to this subsystem. MfV! 
sends this state along with 
real transition notifications 
for other nodes. 


35 


Joined 


Joined 


Forced-Leaving 


MM should never send 
or?/ya Joined notification to 
this subsystem. MM sends 
this state along with real 
transition notifications for 
other nodes. 



40 

B. EVENT MANAGER 



Event manager subsystem 54 is a user space subsystem which provides ciuster-wide availability to client services. 
This latter function is performed by a registration and launch service 56 (Figs. 16-21) which is a component of event 
^5 manager subsystem 54 (Fig. 21 ). Event manager subsystem 54 includes an event manager daemon 58 having multiple 
watchers 60a-50g which monitor for particular conditions. If a watcher detects a problem, the event manager subsystem 

54 will resolve ihn problem via action functions 62. Registration and launch service 56 may be considered a watcher 
of event manager daemon 58, but performs additional useful functions as will be explained in more detail below. 

A client service is any computing activity, including user applications, which is performed on one node or oh more than 
5^ one node in a cluster One difficulty is determining which node or nodes should initially provide each client service. Addi- 
tionally, there must bo coordination of which node implements which services during a failure scenario. If individual client 
sen/ices were lo tmploment their own mechanisms for detenmining the best location to execute, and for initiating recovery 
•'if the node currently providing the services leaves the cluster, a heavy burden would be placed upon the administrative 
management of the cluster. Registration and launch sen/ice 55 provides cluster awareness to non-cluster-aware applica- 
tions by choosing v>/hich node a client service will execute on, registering the client service with that node, and notifying 
nodes presently in the cluster that the particular client service is registered with that node. Registration and launch service 

55 additionally will provide an optional launching, or execution, capability, which is invoked when the service is registered 
at a particular node, i he launching capability can additionally be used to transfer a service from one node to another in a 
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controlled fashion. The launching capabiiity, referred to as "action functions," enhances cluster-wide availability by providing 
the ability to initiate, migrate, and terminate client sen/ices. Such transfers of the client service may result, for example, 
from nodes joining and leaving the cluster. Examples of services which may make principle use of the registration and 
launch service include, by way of example^ printing subsystems, floating internet protocol (IP), networking senaices, and 
5 license server, although other services may advantageously make use of the registration and launch sen/ics. 

1. Registration 

Registration is the process where one node, or more than one node, in a cluster can claim, or is (are) assigned, 
10 responsibility lor a previously defined client service. It is performed on a cluster-wide atomic basis. A registration indi- 
cates a claim of responsibility that the registered node is fulfilling the obligations of the specified client service. As a 
consequence of registration, or de-registration, optional service start-up, notification, and shut-down commands, known 
as'"action functions", will be invoked. In this manner, registration may initiate, or launch, the client sen/ice and provide 
cluster awareness to the user service. Cluster awareness is a result of notification, upon completion of registration, to 
other nodes in the cluster, as well as nodes subsequently joining the cluster, of the registration. 

2. Choosing 

The choice of which node is assigned a particular responsibility, as part of a registration operation, is guided by a 
20 sel of choosing parameters, or database items, 64. These choosing parameters may include a set of database items 
which specify when, where, and under what conditions a sen/ice should be registered. However, additional criteria mav 
be included in the choosing function including recent performance statistics of particular nodes. Administrator-supplied 
priority factors may be selected as follows: 

2S AllowabIe_Nodes- The nodes from the cluster where registration is allowed. All nodes must be potential members 

of the cluster although they need not be powered up. A single wildcard character may be utilized to designate all 
potential nodes of the cluster. 

Node Preferences - Node preferences result from the fact that not all nodes will support ail client services equally 
well. Node preferences may be specified as an unordered list or as an ordered list. Selection among unordered 
30 members will be influenced by recent penormance characteristics of the cluster. Ordered lists are processed be- 

ginning with the highest rank member. 

Disallowable_Nodes - The nodes from the cluster where registration is not allowed. Adding a node to a client 
service's disallowable node's field does not automatically initiate a transfer of the service. 
Auto_Register - This is used when the cluster is first powered up, wherein each user client service potentially 
35 needs to be registered and started. The auto-register field allows the administrator to define under what conditions 

a service should be registered. 

Placement_Poiicy - This indicates what type of registering philosophy is in place; namely whether the client 
service is to be registered on exactly one node or is to be registered and started on every allowabIe_node. 

40 As illustrated in Fig. 18, a client service may be started from either an auto-start 66 or an external start 68. The 

registration and launch sen/ice of a decision-making node selects the best node at 64 utilizing choosing parameters as 
previously set forth. The decision-making node can be any node in the cluster, it is determined by possession of a file 
lock and is processed using a cluster-wide semaphore. If maturity applies to the client sen/ice, the registration and launch 
service transfers to a state 70 awaiting maturity. Maturity refers to the maturity of the cluster. A cluster is mature with 

45 respect to a given client service if at least one node from the allowabie_node's list is up and (a) the primar^y node is 
available, (b) enough nodes are up, or (c) enough time has gone by. Once the cluster is mature (70), or If maturity does 
not apply, the registration and launch service notifies (72) the selected node to start the service and the other nodes 
that the client sen/ice has been assigned to a particular node. The client service is then started or launched (74). Each 
registration and launch client sen/ice has independent choosing factors, except as described below (grouping). 

so • • . 

3. Grouping 

Registration and launch service 56 allows client sen/ices to be associated with each other in an association known 
as a grouping 76 (Figs. 16 and 17). The grouping mechanism is a relationship between a parent client sen/ice 78 and 
55 one or more child client services 80. The purpose of this grouping arrangement is to allow the administrator to specify 
associations where specific services must be placed together. The child will be placed wherever the parent is placed. 
Children services do not have any choosing factors; only the parents' choosing factors are used. A grouping 76' may 
include child services SO that are children of another service SO', which is, in turn, the child of parent service 78, as 
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illustrated in Fig. 17, all of the children (60 and SO') would be under the placement of parent service 78. 

The registration for each child SO is pended until its parent is successfully registered and its star1_command, if 
any, has successfully connpleted. After the parent completes, each child is processed, Similarly, a de-registration of a 
parent implies the de-registration of the children. Children are de-registered first with stop_commands invoked as 
5 appropriate. When transferring a grouping 76, grouped children 80 are always stopped first and started last. 

4. Action Functions 

Registration and launch service 56 supports a variety of actions to take place as a consequence of registration 
70 transactions. Actions are the "launch" aspect of the registration and launch service. When combined with groupings 
76 and choosing parameters 64, actions provide that many client services can depend entirely on the registration and 
launch service for cluster-wide availability. Initiation, migration, and termination may all be carried out directly with 
registration and launch service 56. 

Registration and launch service 56 in the illustrated embodiment includes four action functions: 

75 

Start_Command - The database is checked to determine it such command is associated with the service upon 
successful registration of the client service. If such command is present, the client service is executed on the 
registered node. The registration is not complete until the start has completed successfully. It a start operation 
fails, an attempt is made to start the client service on another node in the allowable_node list. 

20 Stop_Command - The database is checked to determine the presence of this command when a client service 

intends to terminate. The de-registration is not complete until the stop_command terminates. 
Notify_Command - This command provides a mechanism whereby other nodes are informed that the client service 
has been assigned to a particular node. When a client service is successfully registered., the database is checked 
to determine if this command is associated with the service. If it is, it is executed on ail nodes in the ailowable-node 

25 list except the registered node. If a node in the allowable.node's list joins the cluster after a service is registered: 

and the service has a notify_command. the command Is initiated on the new node. This includes nodes which leave 
and subsequently rejoin the cluster. If there is a start_command, the notification is pended until successful start. 
Recovery_Command - This is used when a node ungracefully leaves the cluster For each service registered on 
the forced out node, the database is checked to determine if there is a recovery_command associated with the 

30 service. If there is, it is executed. The node for the recovery operation is determined using the choosing parameters 

64. When the recovery completes, the service is de-registered. Typically, the service will then be registered and 
started on one of the surviving nodes. 

A concept closely related to an action function is that of transfer Transfer of a service is accomplished through a 

35 combination of two action functions. First, the sen/ice is located and de-registered. Second, it is registered and started. 
A service transfer may be very helpful under vanous circumstances. In one circumstance, the administrator may wish 
to move a service. In another circumstance, the transfer function is used to transfer all of the seryices for a node that 
is being gracefully shut down. In another circumstance, the service, by its nature, may be trivia! to move. Because 
there is no impact to moving it, such service may be automatically transferred when a preferred node joins the cluster 

40 if the service is placed on s node other than a preferred node. 

When a client service is in the process of being transferred from one node to another, a "transfer intent" flag is set 
by the transferring node. Effect of the transfer intent flag on the transitions of registration and launch service 56 may 
be seen by reference to Fig. 19. Registration and launch service 56 includes a starting state S2, a registered state S4, 
a stopping state 86, a de-registered state 88, and a recovering state 90. Each of the stopping, starting, and recovering 

45 States will be skipped if their respective command does not exist. The starting state 82 indicates that the service is in 
the process of starting. If start is successful, the service goes to a registered state 34, which indicates that the service 
claims to be operational on some node. The registered service transfers to stopping state 86 if the transfer intent flag 
is set as part of transferring to another node in a graceful leave of the operational node. The registered service transfers 
to stopping state S6 as part of an external de-register or a transfer operation. Upon completion of the stopping command, 

50 the service transitions from the stopping state 86 to the de-registered state 86, indicating that the service is currently 
not registered and that no node transitions are currently underway. A service being transferred normally proceeds 
immediately from de-registered state 86 to starting state S2. Recovering state 90 only occurs if a node ungracefully 
leaves the cluster while the service was in the registered state 84 or in the stopping state 86. A more detailed state 
transition diagram is illustrated in Fig. 20, illustrating various intermediate states. 

55 Table 2 illustrates an example of state transition in a three-node cluster having nodes NO, Nl, and N2. The example 

is based upon the allowable_nodes being node NO and node N1 , with a maturity_count equal to 2 and a maturity „time 
equal to 5. Auto_register is set to auto. The example applies to a single client service. At the beginning of the example, 
all nodes are down. At time t^, node N2 boots so that the status of the node changes to an up condition. No change 
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occurs for the service because node N2 is not in the atlowabie_nods's list. At time i^, the service is awaiting cluster 
maturity and node NO boots. At time t3, the maturity Jime lapses and the service start command is executed on node 
NO. Start completesat time t4 and the service is registered on node NO, At time ts. node N1 boots. Because the service 
is already registered, no action occurs. However, if a "notification action" is defined, it would be executed on node N1 

5 at this time. At time tg, node N2 gracefuliy leaves the cluster, which also has no influence on the service. At time 
node NO begins a graceful leave. This results in the service bsing in the stopping state with the transfer intent set. 
Stopping is completed at time tg. The service is transferred to node N1 in the starting state at time tg. The service 
becomes registered on N1 at time t^Q when start completes. At time X^^, node NO completes the graceful leave and 
goes down. This does not represent a change in the database because the up/down status of a node is not a state 

70 maintained in the database. At time {-^2^ node N1 begins a graceful leave which places the service in a stopping state. 
When stop completes at time 1^3, the senyice becomes de-registered and the transfer intent flag is cleared. 



TABLE 2 
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In an example illustrated in Table 3. node NO is defined as a preferred node with the remaining nodes as allowable. 
The maturi1y_count is set to 2. Auto. register is set to auto. P-lacennenUpolicy is set to register_on_one. Transfer_cost 
is free At the beginning of the examote. the three nodes. NO. N1 , and N2. in the cluster are all down. At time t, . r^ode 
NO boots and the sen/ice enters the starting state on nods 0. At time start completes and the service !s registered 
on node NO. At time tg, node N1 boots and is notified that the service is registered on node NO. Between times t^ and 
t- node NO gracefully leaves the cluster. The service transfers to node N1 and is registered there. At time tg. node N2 
boots and is notified that the service is registered on node N1 . At time Iq. node NO boots and is notified that the sen/ice 
is registered on node 1 . At time X^q, shortly after it boots, node NO notices that free transfers are allowed and that it is 
preferred over node N1. Node NO will automatically initiate a transfer. The service enters the stopping state 86 and 
stop completes at time t^^ . At time t^s, the service enters the starting state on node NO. Starting is complete at time \^ 3 
and the service is registered on node NO. Nodes N1 and N2 are notified that the service is registered on node NO. 
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STATE TRANSITION EXAMPLE B 
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5. Event Manager Daemon 

Registration and launch service 56 will automatically request event manager daemon 58 to monitor for a condition 
relating to a client service. Event manager daemon 5S responds by defining an event_group. The registration and 
launch service will request the event manager daemon to enable this event_group upon registration of the service and 
disable the event_group in response to de-registration of the service. If event manager daemon 58 detects a problem, 
an event action 62 will be invoked to resolve the problem. No direct communication will be retumed to the registration 
and launch service. 

Table 4 illustrates the fields associated with an event monitored by event manager daemon 58. In addition to the 
event name field, there are fields for IN parameters and OUT parameters which define the event that the appropriate 
event watcher is set to detect and the OUT paramesters are filled in when the event occurs. The output is made available 
to the action function 62 associated with the particular event. Event_groups are used to logically associate otherwise 
independent events in order to specify when, where, and under what conditions to enable them. 



TABLE 4 



EVENT MANAGER DAEMON 


Field 


Description 


Event Name 


A string that identifies an event instance; it is unique within the cluster 


IN Parameters 


A fixed set of.name-value pairs thai define an event; they are used by the appropriate event 
watcher to detect the event. 


OUT Parameters 


A fixed set of name-value pairs that describe an occurrence of the event. 


Action Function 


This command line describes what happens if the event occurs; it may reference values from 
the IN and OUT parameters. 



in event manager subsystem 54, event manager daemon 53 is the center of control. All watchers 60a-60g connect 
via a communication library to the event manager daemon. One of the watchers provided in event manager subsystem 
54 is membership manager watcher 60e, which receives notifications from membership manager subsystem 46 of 
node transitions in the manner previously described and provides an' interface to transition notification framework 44. 
Event manager subsystem 54 provides awareness to registration and launch service 55 of such node transitions. 

One example of an application for which registration and launch service 56 is especially apropos is to provide a 
floating license server on cluster system 25. The licensed software could be established as a service and could be 
allowed to execute on a given number of nodes in the cluster The registration and launch service will run the start 
program that brings up the licensed software on one of the nodes. If that node goes down gracefully, or ungracefully, 
the registration and launch service will transfer the licensed software to a new node, after recovery if the. leave was 
ungraceful. 

Thus, it is seen that the present embodiment provides a tightly coordinated cluster membership manager framework 
which coordinates the joining or leaving among all nodes in a cluster, including taking the multiple layers of involved 
subsystems through the transitions. One of the subsystems may be in user space and carries out the transfers of client 
services, including user applications, resulting from nodes joining and leaving the cluster Other user space applications 
may register with the membership manager transition notification framework at run time. Thus, a robust system is 
provided which enhances the high aggregate performance of the multiprocessor cluster technology. 

The present embodiment facilitates the use of multiprocessor cluster systems with operating systems having mul- 
tiple subsystems which are layered by taking all of the involved subsystems through node transitions. It also brings 
cluster awareness to non-cluster-aware client services, which include a wide variety of computing activities including 
user applications. This allows users to treat the cluster system as a single unit with the cluster system providing cluster- 
wide availability to the client service, including initiation of the client service on a particular node, migration of the client 
service between nodes, and termination of the client service. 

Changes and modifications in the specifically described embodiments can be earned out without departing from 
the principles of the invention, which is intended to be limited only by the scope of the appended claims. 

Claims 

1 . in a multiprocessor system having multiple nodes, a shared resource accessible to all nodes and multiple subsys- 
tems on each of said nodes, a method of combining particular ones of said nodes in a cluster that appears sub- 
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stantially as a unified system to users of said system, including notifying subsystems running on nodes presentiy 
in the cluster of transitions of nodes joining and leaving the cluster in order to provide a consistent view of active 
membership in the cluster 

5 2. The method of claim 1 wherein the subsystems are interdependent in levels, with higher level subsystems de- 
pendent on the operation of lower level subsystems. 

3. The method of claim 2 including notifying one of said subsystems on all of said nodes presently in the cluster of 
a transition, processing that notification at said one of said subsystems prior to notifying another of said subsystems 

10 on all of said nodes presently in the cluster of the transition. 

4. The method of claim 3 including notifying subsystems beginning with lower level subsystems and proceeding in 
sequence through higher levels of subsystems of a transition of a node joining the cluster. 

1S 5, The method of claim 3 including notifying subsystems beginning with higher lever subsystem's and proceeding in 
sequence through lower levels of subsystems of a transition of a node gracefully leaving the cluster. 

6. The method of claim 3 including notifying subsystems beginning with lower level subsystems and proceeding in 
sequence through higher levels of subsystems of a transition of a node forced from the cIusTer by other processors. 

20 

7. The method of claim 2 wherein said subsystems include a higher level subsystem which interacts with user pro- 
grams. 

8. The method of claim 7 wherein said higher level subsystem includes a service which automatically and atomically 
2S transfers user programs to other nodes when the node executing the user programs leaves the cluster. 

9. The method of claim 2 wherein said subsystems include a distributed lock manager subsystem, a virtual disk 
manager subsystem and a shared file subsystem. 

30 1 0. The method of claim 1 wherein for a transition of a node joining the cluster, said method includes the steps of: 

a) registering subsystems of the joining node to receive transition notifications; 

b) joining the node to the cluster; and 

c) notifying registered subsystems in the cluster that the joining node has joined the cluster 

35 

11. The method of claim 1 wherein for a transition of one node being forced out of the cluster by another node, said 
■method includes the steps of: 

a) the another node notifying registered subsystems that the one node is being forced out of the cluster; and 
^0 b) transferring registered programs executing on said one node atomically to a different node and recovering 

the programs to execute on said different node. 

12. In a multiprocessor system having multiple nodes; a shared resource accessible to all nodes and multiple subsys- 
tems on each of said nodes, a method of initiating client services on particular ones of said nodes in a cluster in 

45 a manner that appears substantially as a unified system to the client sen/ices, including choosing a node for each 

client service, registering the client sen/ice with that node, and notifying nodes presently in the cluster that the 
particular client sen/ice is registered with the particular node, whereby the particular service can be transferred to 
another node if the particular node leaves the cluster. 

so 13. The method of claim 12 further including launching a client sen/ice on a node according to an action parameter 
included with the client service in response to registering that client semce with that node. 

14. The method of claim 1 3 further including grouping client sen/ices as a parent client sen/ice and at least one child 
client sen/ice, registering grouped client services with the same node and launching grouped client sen/ices ac- 

55 cording to an action parameter included with the parent client sen/ice. 

15. The method of claim 12 wherein said choosing a node includes providing a database of choosing factors for the 
client seivice and applying said choosing factors to the nodes presently in the cluster, said choosing factors es- 
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tablishing rules relating nodes to the client se.rvice. 

dSil^wrKi"^ T"^ l^' ''l'^''*' '^'^"^ factors are selected from the group including allowable nodes 

disallowable nodes and node preferences "uues. 

5 

17. The method of claim 12 further including notifying nodes joining the cluster that the particular client seo^ice is 
registered with the particular nodes. seivice is 

w lTtche!''°'^ °' ^'^'"^ ' ^ monitoring a client service registered with a node at that node using an event 

19. The rnethod of claim 18 "including enabling the event watcher in response to registering the client service and 
. disabling the event watcher in response to de-registering the client service. 

75 20. The method of claim 1 2 wherein said multiple subsystems includes a cluster membership manager which controls 

orhlrnrs1:e^n7;:s 

- o'^e%TsS m^ilttsfbsTsfemr''' ^'"'"'^ ^ °' ^^^^^^^ 

22. A multiprocessor cluster system having multiple nodes, a shared resource accessible fo all nodes a cluster com- 
munication medium between said nodes, and multiple subsystems on each of said nodes, compris^g 

lulTrZTr^r'^''' r^"^^^' subsystem adapted to notify subsystems running on nodes presently in the 
mrn;rership?;r:,::t:r^^ ^^'-^"^ '^^^^-^ ^^^^^^ - - ^-^'^^ ^ consistent view o^ activ: 

an event manager subsystem adapted to detect and react to cluster errors- and 

pahSulafrelSi^ir^''""^ responsive to said event manager and adapted to initiate client se,-vices on 
particular ones of said nodes in a cluster in a manner that appears substantially as a unified node to the client 
services wherein said registration and launch subsystem chooses a node for each client service reais ers 

;i:rerwitr;^^ ;a:;ra" ^^^^^^^ ^--'^ - - -S- - 

le::. suSS^-T^^^^^^^ '--pendent in levels, with higher 

Il'svsteSsTal? ofsSd'n 'h''" " '^'^ "^^"^'^^-^'P -anaoer subsystem notifies one of said 

40 subsystems on all of said nodes presently in the cluster of a transition, and that one of said subsystems on all 

ter^: on'S:rsaid nod^^^^^^^^^^ '° ^"'^ ^^^^-^^'P — ^er subsystem not.ying another of's^ subsys 

tema on all of said nodes presently in the cluster of the transition. 

I5h "weT^evX°Lt^ ms " '"i" membership manager notifies subsystems beginning 

ntdelorng ml cluster P^°=--d,na in sequence through higher levels of subsystems of a transition of a 

wSh Zo^ens^^^^^^^^^ " ''"T. "'^^^"'^ ^"'^ membership manager notifies subsystems beginning . 

node graclfu; leav^ gte : ^^^"^"^^ '^^^'^ °' °' ^ ^--^'^^ ^ 

27. The multiprocessor cluster system in claim 24 wherein said membership manager notifies subsystems beainnina 
r =o'm^~ '-els^of subsystemfrtrr ofl 

sa4e ^thlha. node '"^ ^"^"^ -^P-- '° -9-tering that client 
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29. The multiprocessor cluster system in claim 28 wherein said registration and launch senyics is adapted to group 
client services as a parent client service and at least one child client service, and wherein said registration and 
launch service further registers grouped client sen/ices with the same node and launches grouped client services 
according to an action parameter included with the parent client seivice. 

30. The multiprocessor cluster system in claim 22 wherein said registration and launch service includes a database 
of choosing factors for the client service and applies said choosing factors to the nodes presently in the cluster to 
choose the node for registering a client service, said choosing factors establishing rules relating nodes to the client 
sew'ice. 

31. The multiprocessor cluster system in claim 22 wherein said registration and launch service further notifies nodes 
joining the cluster that the particular client service is registered with the particular node. 

32. The multiprocessor cluster system in claim 22 wherein said event manager includes an event watcherfor monitoring 
15 a client service registered with a node at that node. 

33. A computer usable medium in which program code is embodied, said program code defining an operating system 
for a multiprocessor cluster system having multiple nodes, a shared resource accessible to all processors, and 
including multiple subsystems, one of said subsystems being a cluster membership manager subsystem adapted 

20 to notify subsystems running on nodes presently in a cluster of transitions of nodes joining and leaving the cluster 

in order to provide a consistent view of active membership in the cluster 

34- A computer usable medium in which program code is embodied, said program code defining an operating system 
for a multiprocessor cluster system having multiple nodes, a shared resource accessible to a!! processors, and 
25 including a registration and launch service adapted to initiate client services on particular ones of nodes in a cluster 

in a manner that appears substantially as a unified system to the client sen/ices, wherein said registration and 
launch service chooses a node for each client service, registers the client sen/ice with that node, and notifies nodes 
presently in the cluster that the particular sen/ice is registered with the particular node. 

30 35. A computer usable medium in which program code is embodied, said program code defining an operating system 
for a multiprocessor cluster system having multiple nodes, and a shared- resource accessible to all nodes, com- 
prising: 

multiple subsystems that are interdependent in levels, with higher level subsystems dependent on the oper- 
35 ation of lower level subsystems; 

one of said subsystems comphsing a cluster membership manager subsystem adapted to notifying subsys- 
tems running on nodes presently in a cluster of transitions of processors joining and leaving the cluster in 
order to provide a consistent view of active membership in the cluster; and 

one of said subsystems including a registration and launch sen/ice adapted to initiate client services on par- 
40 ticular ones of said nodes in a cluster in a manner that appears substantiaiiy as a unified node to the client 

sen/ices, wherein said registration and launch service chooses a node for each client sen/ice, registers the 
client service with that node, and notifies nodes presently in the clusterthat the particular service is registered 
with the particular node. 

45 36. In a multiprocessor system having multiple nodes, and a shared resource accessible to al! nodes, a method of 
initiating client services on particular ones of said nodes In a cluster in a manner that appears substantially as a 
unified system to the client service including registering a client service with one of said nodes and launching the 
client service on that node according to an action parameter included with the client service in response to regis- 
tering that user sen/ice with that node. 

so 

37. The method in claim 36 including transferring the client service to another node if that node leaves the cluster. 

38. The method in claim 36 wherein said transferring includes relaunching the client service on the another node 
according to said action parameter. 
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